The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for the Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook will give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they wish to explore different techniques.
!pip install yellowbrick
Requirement already satisfied: yellowbrick in /opt/anaconda3/lib/python3.9/site-packages (1.5) Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.6.1) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (3.5.2) Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.0.2) Requirement already satisfied: numpy>=1.16.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.21.5) Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (0.11.0) Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.2.0) Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0) Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0) Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To compute distances
from scipy.spatial.distance import cdist
# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer
# Importing PCA
from sklearn.decomposition import PCA
# To encode the variable
from sklearn.preprocessing import LabelEncoder
# Importing TSNE
from sklearn.manifold import TSNE
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To compute distances
from scipy.spatial.distance import pdist
# To import K-Medoids
from sklearn_extra.cluster import KMedoids
# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Importing DBSCAN
from sklearn.cluster import DBSCAN
# loading the dataset
data = pd.read_csv("marketing_campaign.csv")
# Print the shape of the data
data.shape
(2240, 27)
There are 2,240 unique observations in the data and 27 columns.
# View first 5 rows
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
# View last 5 rows Hint: Use tail() method
data.tail()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
Due to the size of our data set we can make some overall observations by looking at the first 5 rows (the head) and the last 5 rows (the tail) of our data. We can see that we have integers, objects, and floats in our data set. We have lot of booleans that indicate the customers interaction with some variables including their interactions with various campaigns as well as their purchase history and complaint record. This categorical data will be useful.
Customers have a unique id number
We are able to see their education level, marital status, income, and family size which will help us separate them into different groups.
We can see how long it has been since they have done any shopping and how much they have spent on select food groupings over the last 2 years.
There are 3 categories for how customers are making their purchases: through the website, catalog or in store.
There have been 5 advertising campaings but in these 10 total rows it does not look like there has been very much customer interacting with them. There is room for improvement in this area. We can compare the customer campaign interaction with their use of the discounts that have been offered to see the relationship.
# Check the datatypes of each column. Hint: Use info() method
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
# Find the percentage of missing values in each column of the data
data.isnull().sum()
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Response 0 dtype: int64
We can observe that ID has no null values. Also the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.
Income has 24 missing values
Dropping the ID column
# Remove ID column from data. Hint: Use inplace = True
data.drop(columns = ['ID'], inplace=True)
# Explore basic summary statistics of numeric variables. Hint: Use describe() method.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.805804 | 11.984069 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
There is significant discrepency between the third quartile and the max for the income, number of catalog purchases as well as the overall food purchases particularly with the fruit, meat, sweet and gold products. This tells us that we have may outliers to the right.
Birth year has a very wide range from 1893-1996, almost 100 years. Average birth year is 1968.
Food purchase categories have a very broad range as well. The minimum purchase in two years for both fruit and sweets is only $1.
# List of the categorical columns in the data
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome", "Complain"]
Number of unique observations in each category
for column in cols:
print("Unique values in", column, "are :")
print(data[column].value_counts())
print("*" * 50)
Unique values in Education are : Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: Education, dtype: int64 ************************************************** Unique values in Marital_Status are : Married 864 Together 580 Single 480 Divorced 232 Widow 77 Alone 3 Absurd 2 YOLO 2 Name: Marital_Status, dtype: int64 ************************************************** Unique values in Kidhome are : 0 1293 1 899 2 48 Name: Kidhome, dtype: int64 ************************************************** Unique values in Teenhome are : 0 1158 1 1030 2 52 Name: Teenhome, dtype: int64 ************************************************** Unique values in Complain are : 0 2219 1 21 Name: Complain, dtype: int64 **************************************************
There are 5 different categories under education
There are 8 different categories under marital status with Alone, Absurd and YOLO having less than 5 observations. The name of these categories and their low count should be a red flag as to their importance.
There are 3 categories in how many kids are in the home. They are 0, 1, and 2.
There are 3 categories in how many teenagers are in the home. They are 0, 1, and 2.
There are 2 categories for complaints, 1 for yes and 0 for no in the last two years.
Think About It:
# Replace the category "2n Cycle" with the category "Master"
data["Education"].replace("2n Cycle", "Master", inplace= True) # Hint: Use the replace() method and inplace=True
# Replace the categories "Alone", "Abusrd", "YOLO" with the category "Single"
data["Marital_Status"].replace("Alone", "Single", inplace= True) # Hint: Use the replace() method and inplace=True
data["Marital_Status"].replace("Absurd", "Single", inplace= True)
data["Marital_Status"].replace("YOLO", "Single", inplace= True)
# Checking to make sure that "Alone", "Absurd", and "YOLO" are no longer listed.
data['Marital_Status'].unique().tolist()
['Single', 'Together', 'Married', 'Divorced', 'Widow']
# Checking the count to make sure they are still included in the data set.
data['Marital_Status'].describe()
count 2240 unique 5 top Married freq 864 Name: Marital_Status, dtype: object
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.
# Create histogram for the Income feature
plt.figure(figsize=(15, 7))
sns.histplot(x='Income', data=data)
plt.show()
We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.
# Plot the boxplot
sns.boxplot(data=data, x='Income', showmeans=True, color="violet")
<AxesSubplot:xlabel='Income'>
Outliers are defined as a sample or even that is very inconsistent with the rest of the data set. The observation point or value would be distant from the other observations in the data set. We can see some outliers here that should be taken care of.
Think About It
# Calculating the upper whisker for the Income variable
Q1 = data.quantile(q=0.25) # Finding the first quartile
Q3 = data.quantile(q=0.75) # Finding the third quartile
IQR = Q3 - Q1 # Finding the Inter Quartile Range
upper_whisker = (Q3 + 1.5 * IQR)['Income'].max() # Calculating the Upper Whisker for the Income variable
print(upper_whisker) # Printing Upper Whisker
118350.5
# Let's check the observations with extreme value for the Income variable
data[data.Income > upper_whisker]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 1973 | PhD | Married | 157243.0 | 0 | 1 | 01-03-2014 | 98 | 20 | 2 | ... | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 617 | 1976 | PhD | Together | 162397.0 | 1 | 1 | 03-06-2013 | 31 | 85 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 655 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 07-02-2014 | 81 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 687 | 1982 | PhD | Married | 160803.0 | 0 | 0 | 04-08-2012 | 21 | 55 | 16 | ... | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1300 | 1971 | Master | Together | 157733.0 | 1 | 0 | 04-06-2013 | 37 | 39 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1653 | 1977 | Graduation | Together | 157146.0 | 0 | 0 | 29-04-2013 | 13 | 1 | 0 | ... | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2132 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 29-08-2013 | 85 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2233 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | 14 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 26 columns
Think About It:
# Check the 99.5% percentile value for the Income variable
data.quantile(q=.995)['Income']
102145.75000000003
3 Standard Deviations from the Mean: 99.5% A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event. For income anything above 102145.75000000003 is also above the 99.5% percentile value and can be considered an outlier.
# Dropping observations identified as outliers
data.drop(data[data['Income'] >= 153924.0].index, inplace = True) #Pass the indices of the observations (separated by a comma) to drop them
Now, let's check the distribution of the Income variable after dropping outliers.
# Plot histogram and 'Income'
plt.figure(figsize=(15, 7))
sns.histplot(x='Income', data=data)
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
# Plot the histogram for 'MntWines'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntWines', data=data)
plt.title('Distribution of Mnt Wines Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')
plt.show()
# Plot the histogram for 'MntFruits'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntFruits', data=data)
plt.title('Distribution of Mnt Fruits Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')
plt.show()
# Plot the histogram for 'MntMeatProducts'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntMeatProducts', data=data)
plt.title('Distribution of Mnt Meat Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')
plt.show()
# Plot the histogram for 'MntFishProduct'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntFishProducts', data=data)
plt.title('Distribution of Mnt Fish Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')
plt.show()
# Plot the histogram for 'MntSweetProducts'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntSweetProducts', data=data)
plt.title('Distribution of Mnt Sweet Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')
plt.show()
# Plot the histogram for 'MntGoldProducts'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntGoldProds', data=data)
plt.title('Distribution of Mnt Gold Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')
plt.show()
Income: This is now looking more like normal distribution and has a bell curve shape since we removed the outliers.
MntWines: Most of the purchases are under $50. Our data is right skewed. This is one of the items with the most amount spent.
MntFruits: Most of the purchases are under $10. Our data is right skewed. This is the item with the lowest total amount spent.
MntMeat: Most of the purchases are under $75. Our data is right skewed with some outliers. This is one of the items with the most amount spent.
MntFish: Most of the purchases are under $15. Our data is right skewed.
MntSweets: Most of the purchases are under $10. Our data is right skewed with some outliers.
MntGold: Most of the purchases are under $25. Our data is right skewed with some outliers.
Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # Length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # Width of the plot
y = p.get_y() + p.get_height() # Height of the plot
ax.annotate(percentage, (x, y), size = 12) # Annotate the percentage
plt.show() # Show the plot
# Bar plot for 'Marital_Status'
perc_on_bar('Marital_Status')
perc_on_bar('Education')
perc_on_bar('Kidhome')
perc_on_bar('Teenhome')
perc_on_bar('Complain')
Marriage:
38.6% of the individuals are married, 25.9% are together, 21.7% are singel, 10.4% are divorced and 3.4% are widowed.
Education:
50.3% have achieved graduation, 25.6% have their masters, 21.7% have their PhD, 2.4% are under the category of basic.
Kidhome:
57.7% of the individuals have 0 kids in their house, 40.1% have 1 kid in their house, 2.1% have 2 kids in their house.
Teenhome:
51.7% of the individuals have 0 teens in their house, 46.0% have 1 teen in their house, 2.3% have 2 teens in their house.
Complain:
99.1% have not made a complaint and .9% have made a complaint.
We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.
Heat map can show a 2D correlation matrix between numerical features.
#heatmap for all the features.
plt.figure(figsize=(15, 7)) # Setting the plot size
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Plotting the correlation plot
plt.show()
Observations
Kidhome has strong negative correlation with Income, all the food categories and all the purchasing options. Teenhome also shows negative correlation with these variables but not as much as Kidhome.
#Heatmap for Numerical features.
num_var = ['Income', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
'AcceptedCmp4', 'AcceptedCmp5']
corr = data[num_var].corr()
plt.figure(figsize=(20, 10)) # Setting the plot size
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Plotting the correlation plot
plt.show()
Income is positively correlation with MntWines (0.73), MntMeatProducts (0.70), NumCatalogPurchases (0.71), and NumStorePurchases (0.69). Income has a strong negative correlation with NumWebVisitsMonth (-0.65) indicating that the higher income people are spending less time on the website and more time purchases what they want through the catalog or store. There is positive correlation for all the food categories and Income and in the campaigns Income has the most correlation with campaign 5.
MntWines is positively correlated with MntMeatPurchases (0.59), NumWebPurchases (0.54), NumCatalogPurchases (0.67) and NumStorePurchases (0.64).
MntMeatProducts is positively correlated with all of the other food categories and has high correlation with NumCatalogPurchases (0.70).
Of all the food categories MntGoldProducts has the lowest correlation with anything else.
NumDealsPurchases is negatively correlated with all the food categories other than MntWines and MntMeatProducts and that correlation is very low. Deals are only showing promising positive correlation with NumWebVisitsMonth.
NumWebVisitsMonth has the most negative correlation across the board the highest being with Income (-0.65) followed by NumMeatProducts (-0.54), NumCatalogPurchases (-0.53), and then the rest of the food categories. It also has negative correlation with AcceptedCmp1 (-0.20) and AcceptedCmp5 (-0.28).
AcceptedCmp1 has most positive correlation with AcceptedCmp5 (.40)
AcceptedCmp2 has most positive correlation with AcceptedCmp4 (0.30)
AcceptedCmp3 has most positive correlation with AcceptedCmp1 (0.09) and is the only campaign to have negative correlation with another campaign, in this case AcceptedCmp4 (-0.08)
Campaign correlation is key when examining our clusters so that we can make the appropriate recommendations to the marketing department.
The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.
print(sns.barplot(x='Education', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
Customers with their PhD have the highest income while those with basic education have the lowest. 50.4% have achieved Graduation, the other half is almost equally divided with 25.6% having their masters and 21.6% having their PhD. The remaining 2.4% are Basic. This tells us that slightly less than a quarter of the customers have the highest income.
# Plot the bar plot for Marital_Status and Income
print(sns.barplot(x='Marital_Status', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
Marital status does not have a great impact on income but widows is slightly higher than the other categories. Widow also makes up only 3.4% of the customers.
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x='Kidhome', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
Customers with no children have a significantly higher income than those who do have children.
We can also visualize the relationship between two categorical variables.
# Plot the bar plot for Marital_Status and Kidhome
pd.crosstab(data['Marital_Status'], data['Kidhome']).plot(kind='bar',stacked=False)
<AxesSubplot:xlabel='Marital_Status'>
Overall we can observe that married individuals had more children total than any other group. They have the highest count for every observation under Kidhome; 0, 1, and 2. This makes sense because married people made up 38.6% of the data which is more than just one fifth.
Single and Together had a relavitely close close count for having 1 child in the home with the category of Together being slightly higher. The Together group makes up 25.9% of the data and the Single group makes up 21.7%.
There are no widows who have 2 children. Most have 0, and a very small proportion have 1.
The number of individuals who have children is also on the low side. This group makes up only 10.4% of the data.
In this section, we will first prepare our dataset for analysis.
Think About It:
# Extract only the year from the Year_Birth variable and subtracting it from 2016 will give us the age of the customer at the time of data collection in 2016
data["Age"] = 2016 - pd.to_datetime(data['Year_Birth'], format="%Y").apply(lambda x: x.year)
# Sorting the values in ascending order
data["Age"].sort_values()
1170 20
46 20
696 21
747 21
1850 21
...
424 75
1950 76
192 116
339 117
239 123
Name: Age, Length: 2232, dtype: int64
If we are basing age by birth year then the youngest individual in our data is 20 and the oldest is 123. We have 3 individuals over the age of 115 which seems like a mistake or possibly these were customers at the time that the data was collected but are no longer living.
Think About It:
# Drop the observations with age > 115
# Hint: Use drop() method with inplace=True
data.drop(data[data["Age"] > 115].index, inplace=True)
data["Age"].sort_values()
1170 20
46 20
2213 21
696 21
995 21
..
39 73
358 73
894 73
424 75
1950 76
Name: Age, Length: 2229, dtype: int64
Here we can see that the ages that were greater than 115 have now been dropped.
Now, let's check the distribution of age in the data.
# Plot histogram to check the distribution of age
plt.figure(figsize=(15, 7))
sns.histplot(x='Age', data=data)
plt.show()
The distribution for age is undefined or bimodal, peaking at around 40 and 45 with a rise at 60 as well. If we had more data this would probably turn into a bell curve or a smoother bimodal model.
Our highest count is at 45 years old.
We are not seeing skewness to the right or left.
# Add Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data['Kidhome'] + data['Teenhome']
perc_on_bar('Kids')
Observations:
In combining kids and teens and changing the category to "Kids" we have created a new option for 3 children in the home instead of the max being 2 as it was previously.
# Check the unique categories in Marial_Status
data['Marital_Status'].unique().tolist()
['Single', 'Together', 'Married', 'Divorced', 'Widow']
# Replace "Married" and "Together" with "Relationship"
data['Marital_Status'].replace(["Married", "Together"], "Relationship", inplace=True)
# Replace "Divorced" and "Widow" with "Single"
data['Marital_Status'].replace(["Divorced", "Widow"], "Single", inplace=True)
#checking our new sub-categories
data['Marital_Status'].unique().tolist()
['Single', 'Relationship']
# Create a new feature called "Status" by replacing "Single" with 1 and "Relationship" with 2 in Marital_Status
data["Status"] = data['Marital_Status'].replace({"Single": 1, "Relationship": 2})
# Add two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data["Status"] + data["Kids"]
perc_on_bar('Family_Size')
data['Family_Size'].describe()
count 2229.000000 mean 2.596231 std 0.907432 min 1.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 5.000000 Name: Family_Size, dtype: float64
plt.figure(figsize=(5, 5))
sns.histplot(x="Family_Size", data=data, bins=5)
<AxesSubplot:xlabel='Family_Size', ylabel='Count'>
print(sns.barplot(x='Status', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
# Create a new feature
# Add the amount spent on each of product 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
data["Expenses"] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts']+ data['MntFishProducts'] + data['MntSweetProducts']
data["Expenses"].describe()
count 2229.000000 mean 561.427995 std 575.375816 min 4.000000 25% 56.000000 50% 342.000000 75% 962.000000 max 2491.000000 Name: Expenses, dtype: float64
print(sns.barplot(x='Expenses', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
# Create a new feature
# Add the number of purchases from each channel 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
data["NumTotalPurchases"] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']
data["NumTotalPurchases"].describe()
count 2229.000000 mean 14.869000 std 7.622187 min 0.000000 25% 8.000000 50% 15.000000 75% 21.000000 max 43.000000 Name: NumTotalPurchases, dtype: float64
# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data['Dt_Customer'])
Let's check the max and min of the date.
# Check the minimum of the date
# Hint: Use the min() method
data["Dt_Customer"].min()
Timestamp('2012-01-08 00:00:00')
# Check the maximum of the date
# Hint: Use the max() method
data["Dt_Customer"].max()
Timestamp('2014-12-06 00:00:00')
Think About It:
# Assigning date to the day variable
data["day"] = "01-01-2015"
# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days
# Add all the campaign related variables to get the total number of accepted campaigns by a customer
# "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]
data["TotalAcceptedCmp"].describe()
count 2229.000000 mean 0.448183 std 0.890118 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 5.000000 Name: TotalAcceptedCmp, dtype: float64
print(sns.barplot(x='TotalAcceptedCmp', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
# Plot histogram and 'Income'
plt.figure(figsize=(15, 7))
sns.histplot(x='TotalAcceptedCmp', data=data)
plt.title('Distribution of Total Accepted Campaigns')
plt.xlabel('Campaigns Accepted')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
sns.boxplot(data=data, x='TotalAcceptedCmp', showmeans=True, color="violet")
<AxesSubplot:xlabel='TotalAcceptedCmp'>
# Divide the "Expenses" by "NumTotalPurchases" to create the new feature AmountPerPurchase
data["AmountPerPurchase"] = data["Expenses"] / data["NumTotalPurchases"]
data["AmountPerPurchase"].describe()
count 2229.000000 mean inf std NaN min 0.466667 25% 7.800000 50% 20.764706 75% 41.529412 max inf Name: AmountPerPurchase, dtype: float64
Now, let's check the maximum value of the AmountPerPurchase.
# Check the max value
# Hint: Use max() function
data['AmountPerPurchase'].max()
inf
Think About It:
# Find how many observations have NumTotalPurchases equal to 0
data['NumTotalPurchases'].value_counts()[0]
2
data['NumTotalPurchases'].sort_values() == 0
981 True
1524 True
2214 False
774 False
2228 False
...
627 False
1252 False
412 False
432 False
21 False
Name: NumTotalPurchases, Length: 2229, dtype: bool
# Drop the observations with NumTotalPurchases equal to 0, using their indices
#data.drop(data[data['NumTotalPurchases'] == 0].index, inplace = True)
data.drop(index=[981, 1524], inplace = True)
data['NumTotalPurchases'].describe()
count 2227.000000 mean 14.882353 std 7.612563 min 1.000000 25% 8.000000 50% 15.000000 75% 21.000000 max 43.000000 Name: NumTotalPurchases, dtype: float64
Now, let's check the distribution of values in AmountPerPurchase column.
# Check the summary statistics of the AmountPerPurchase variable
data['AmountPerPurchase'].describe()
count 2227.000000 mean 30.540586 std 44.017468 min 0.466667 25% 7.800000 50% 20.736842 75% 41.325630 max 1657.000000 Name: AmountPerPurchase, dtype: float64
# Plot the histogram for the AmountPerPurchase variable
plt.figure(figsize=(15, 7))
sns.histplot(x="AmountPerPurchase", data=data)
plt.title('Distribution of Amount Per Purchase')
plt.xlabel('Dollar Amount')
plt.ylabel('Count')
plt.show()
We have one outlier that is extremely high. The majority of the purchases have a lower dollar amount.
# Impute the missing values for the Income variable with the median
data['Income'] = data['Income'].fillna(data['Income'].median())
Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis
x= data['Income']
y= data['Expenses']
plt.figure(figsize=(20, 10)) # Setting the plot size
sns.scatterplot(x,y) # Hint: Use sns.scatterplot()
plt.xticks(fontsize=16) # Font size of X-label
plt.yticks(fontsize=16) # Font size of Y-label
plt.xlabel("Income", fontsize=20, labelpad=20) # Title of X-axis
plt.ylabel("Expenses", fontsize=20, labelpad=20) # Title of Y-axis
Text(0, 0.5, 'Expenses')
We can see positive correlation that is exponential in nature. There is a positive exponential relationship between income and expenses. A linear model will not fit our data well.
# Plot the bar plot for Family Size on X-axis and Income on Y-axis
plt.figure(figsize = (15,5))
sns.barplot(data = data, x = 'Family_Size', y = 'Income', ci=False)
plt.show()
Customers who's family size is only 1 have the highest income followed by those with a family size of 2. In all likelihood this is single adults and couples with no children.
The family size with the lowest income is those with 4 people in their family followed closely by those with 3 people and then those with 5 people.
The most significant thing separating these groups seems to be whether or not they have children.
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
Here, we will use the behavioral attributes for segmentation and drop the demographic attributes like Income, Age, and Family_Size. In addition to this, we need to drop some other columns which are mentioned below.
Dt_Customer: We have created the Engaged_in_days variable using the Dt_Customer variable. Hence, we can drop this variable as it will not help with segmentation.Complain: About 95% of the customers didn't complain and have the same value for this column. This variable will not have a major impact on segmentation. Hence, we can drop this variable. day: We have created the Engaged_in_days variable using the 'day' variable. Hence, we can drop this variable as it will not help with segmentation.Status: This column was created just to get the Family_Size variable that contains the information about the Status. Hence, we can drop this variable.Education and Marital_Status, Kids, Kidhome, and Teenhome as distance-based algorithms cannot use the default distance like Euclidean to find the distance between categorical and numerical variables.AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, and Response for which we have create the variable TotalAcceptedCmp which is the aggregate of all these variables.# Dropping all the irrelevant columns and storing in data_model
data_model = data.drop(
columns=[
"Year_Birth",
"Dt_Customer",
"day",
"Complain",
"Response",
"AcceptedCmp1",
"AcceptedCmp2",
"AcceptedCmp3",
"AcceptedCmp4",
"AcceptedCmp5",
"Marital_Status",
"Status",
"Kids",
'Education',
'Kidhome',
'Teenhome', 'Income','Age', 'Family_Size'
],
axis=1,
)
# Check the shape of new data
data_model.shape
(2227, 17)
# Check first five rows of new data
data_model.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 1529 | 25 | 997 | 1 | 61.160000 |
| 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 21 | 6 | 151 | 0 | 3.500000 |
| 2 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 734 | 21 | 498 | 0 | 34.952381 |
| 3 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 48 | 8 | 91 | 0 | 6.000000 |
| 4 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 407 | 19 | 347 | 0 | 21.421053 |
Let's plot the correlation plot after we've removed the irrelevant variables.
# Plot the correlation plot for new data
plt.figure(figsize = (15, 15))
sns.heatmap(data_model.corr(), annot = True, fmt = ".2f")
plt.show()
Observations and Insights:
We can observe high positive correlation between 2 variables, Expenses and MntWines purchases. This is the only correlation that is truly high at 0.90.
We have good correlation between NumberTotalPurchases and MntWines (0.72), NumWebPurchases (0.79), NumCatalogPurchases (0.74), NumStorePurchases (0.83), and Expenses (0.74). Customers who have a high total number of purchases are purchasing more wine from all the different channels, in store being the highest at 0.83.
There is also good correlation between Expenses and MntMeatProducts (0.86), NumCatalogPurchases (0.79), and NumTotalPurchases (0.74).
Good correlation between MntMeatProducts and NumCatalogPurchases (0.70).
Weak positive correlation exists between many categories, namely AmountPerPurchase and MntWines and MntMeatProducts. Also NumWebPurchases and MntWine.
With the advertising campaings in mind we can see that the highest correlation (which is still being considered a weak correlation) for TotalAcceptedCmp is MntWines (0.49). There is also some weak correlation for TotalAcceptedCmp and MntMeatProducts (0.34), NumCatalogPurchases (0.37), and Expenses (0.46).
The highest correlation for NumDealPurchases is with NumWebVisitsMonth (0.37) which is interesting to note from a marketing perspective although this correlation is still weak.
Good negative correlation exists between NumWebVisitsMonth and MntMeatProducts (-0.54), NumCatalogPurchases and Expenses.
We have weak correlation between NumWebVisitsMonth and MntWines, MntFruits, MntFishProducts, MntSweetProducts, NumStorePurchases, NumTotalPurchase, and AmountPerPurchase.
Through observing the above correlation matrix we can see that a high number of combined expenses is closely related to wine purchases and meat purchases. This makes sense because of all the things being sold, these two sub-categores are the most costly. Wine purchases hold the most correlation with every other catagory of interest across the board. People who are buying wine are shopping more often, visiting the website more, spending more, accepting more campaigns, and are more likely to make a purchase with a deal than another other food category (although this is still a very weak relationship.
Higher number of total purchases are fairly evenly distributed between the website, catalog and store with store being the highest.
Interestingly people who are visiting the website more often in a month are less likely to purchase anything. They have a lower number of total purchases, store purchases and spend less per purchase on every food category. They people may be spending more time looking for the best deal. This can be observed by noting that the have negative correlation with every category other than NumDealsPurchases and Engaged_in_days in which they have the highest correlation. This tell us that people who have been customers the longest are also waiting for the best deals by checking the website regularly.
What is feature scaling?
Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:
If you have multiple independent variables like Age, Income, and Amount related variables, with their range as (18–100 Years), (25K–75K), and (100–200), respectively, feature scaling would help them all to be in the same range.
Why feature scaling is important in Unsupervised Learning?
Feature scaling is especially relevant in machine learning models that compute some sort of distance metric as we do in most clustering algorithms, for example, K-Means.
So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.
Let's scale the data
Standard Scaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.
# Applying standard scaler on new data
scaler = StandardScaler() # Initialize the Standard Scaler
df_scaled = scaler.fit_transform(data_model) # fit_transform the scaler function on new data
df_scaled = pd.DataFrame(df_scaled, columns=data_model.columns) # Converting the embeddings to a dataframe
df_scaled.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.306906 | 0.979274 | 1.549793 | 1.735137 | 2.455586 | 1.471064 | 0.841828 | 0.357919 | 1.404892 | 2.633008 | -0.561330 | 0.696875 | 1.681096 | 1.329371 | 1.975678 | 0.619416 | 0.695776 |
| 1 | -0.384035 | -0.873925 | -0.638021 | -0.726901 | -0.652191 | -0.633425 | -0.732464 | -0.169952 | -1.119121 | -0.586255 | -1.178960 | -0.135935 | -0.940315 | -1.167064 | -1.667464 | -0.503906 | -0.614453 |
| 2 | -0.798600 | 0.358572 | 0.569049 | -0.175222 | 1.340442 | -0.149634 | -0.041311 | -0.697824 | 1.404892 | -0.228559 | 1.291559 | -0.552339 | 0.299119 | 0.803806 | -0.173173 | -0.503906 | 0.100251 |
| 3 | -0.798600 | -0.873925 | -0.562579 | -0.663070 | -0.505942 | -0.585045 | -0.751662 | -0.169952 | -0.758548 | -0.943951 | -0.561330 | 0.280470 | -0.893380 | -0.904281 | -1.925843 | -0.503906 | -0.557644 |
| 4 | 1.550599 | -0.392806 | 0.418165 | -0.216256 | 0.152175 | -0.004497 | -0.559676 | 1.413662 | 0.323172 | 0.129137 | 0.056299 | -0.135935 | -0.269317 | 0.541023 | -0.823427 | -0.503906 | -0.207226 |
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 35) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
data_air_pol_tsne = tsne.fit_transform(data_model) # fit_transform T-SNE on new data
data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns=[0, 1]) # Converting the embeddings to a dataframe
plt.figure(figsize=(7, 7)) # Scatter plot for two components
sns.scatterplot(x=data_air_pol_tsne.iloc[:,0], y=data_air_pol_tsne.iloc[:,1], data=data_air_pol_tsne)
#sns.scatterplot(x=0, y=1, data=data_air_pol_tsne) # Plotting T-SNE
<AxesSubplot:xlabel='0', ylabel='1'>
Observation and Insights:
T-SNE did not work well with just 2 dimensions. We are not able to see clear clusters here. There is too much multicollinearity which affected our clustering method.
Think about it:
When the variables used in clustering are highly correlated, it causes multicollinearity, which affects the clustering method and results in poor cluster profiling (or biased toward a few variables). PCA can be used to reduce the multicollinearity between the variables.
# Defining the number of principal components to generate
n = data_model.shape[1] # Storing the number of variables in the data
pca = PCA(n_components = n, random_state = 1) # Initialize PCA with n_components = n and random_state=1
data_pca = pd.DataFrame(pca.fit_transform(df_scaled)) # fit_transform PCA on the scaled data
# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_
Let's plot the first two components and see how the data points are distributed.
# Scatter plot for two components using the dataframe data_pca
plt.figure(figsize = (7, 7))
sns.scatterplot(x = data_pca[0], y = data_pca[1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
Let's apply clustering algorithms on the data generated after applying PCA
distortions = [] # Create an empty list
K = range(2, 10) # Setting the K range from 2 to 10
for k in K:
kmeanModel = KMeans(n_clusters=k,random_state=4) # Initialize K-Means
kmeanModel.fit(data_pca) # Fit K-Means on the data
distortions.append(kmeanModel.inertia_) # Append distortion values to the empty list created above
# Plotting the elbow plot
plt.figure(figsize=(16, 8)) # Setting the plot size
plt.plot(K, distortions, "bx-") # Plotting the K on X-axis and distortions on y-axis
plt.xlabel("k") # Title of x-axis
plt.ylabel("Distortion") # Title of y-axis
plt.title("The Elbow Method showing the optimal k") # Title of the plot
plt.show()
In the above plot, the elbow is seen for K=3 and K=5 as there is some drop in distortion at K=3 and K=5.
Think About It:
We can use the silhouette score as a metric for different K values to make a better decision about picking the number of clusters(K).
Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].
Finding silhouette score for each value of K
sil_score = [] # Creating empty list
cluster_list = range(3, 7) # Creating a range from 3 to 7
for n_clusters in cluster_list:
# Initialize K-Means with number of clusters equal to n_clusters and random_state=1
clusterer = KMeans(n_clusters= n_clusters, random_state = 1)
#clusterer = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
# Fit and predict on the pca data
clusterer.fit(data_pca)
preds = clusterer.fit_predict(data_pca)
# Calculate silhouette score - Hint: Use silhouette_score() function
score = silhouette_score(data_pca, preds)
# Append silhouette score to empty list created above #not (preds)
sil_score.append(preds)
# Print the silhouette score
print( "For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
For n_clusters = 3, the silhouette score is 0.2718010164963816) For n_clusters = 4, the silhouette score is 0.25228159218826157) For n_clusters = 5, the silhouette score is 0.23855028311006854) For n_clusters = 6, the silhouette score is 0.1219241977999894)
From the above silhouette scores, 3 appears to be a good value of K. So, let's build K-Means using K=3.
kmeans = KMeans(n_clusters=3, random_state = 1) # Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans.fit(data_pca) # Fitting on the data_pca
KMeans(n_clusters=3, random_state=1)
data_pca["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to data_model
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
1 1059 0 607 2 561 Name: K_means_segments_3, dtype: int64
**Let's visualize the clusters using PCA**
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")
Observations and Insights:
Cluster sizes are cluster 0 with 607, cluster 1 with 1059 and cluster 2 with 561. Our clusters have boundaries and are separated well.
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_KMeans_3 = data.groupby("K_means_segments_3").mean()
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_3 | ||||||||||||||||||||||||||||||||
| 0 | 1965.861614 | 57700.039539 | 0.278418 | 0.846787 | 47.731466 | 455.062603 | 23.329489 | 138.293245 | 30.607908 | 24.774300 | 62.331137 | 3.795717 | 6.487644 | 3.102142 | 7.906096 | 5.739703 | 0.070840 | 0.123558 | 0.023064 | 0.036244 | 0.016474 | 0.008237 | 0.133443 | 50.138386 | 1.125206 | 1.658979 | 2.784185 | 672.067545 | 21.291598 | 592.602965 | 0.403624 | 31.344661 |
| 1 | 1971.018886 | 35319.806421 | 0.755430 | 0.474032 | 49.210576 | 45.411709 | 5.047214 | 23.671388 | 7.395656 | 5.148253 | 15.574127 | 2.021719 | 2.146364 | 0.579792 | 3.288952 | 6.379603 | 0.068933 | 0.015109 | 0.000000 | 0.000944 | 0.001889 | 0.010387 | 0.084986 | 44.981114 | 1.229462 | 1.650614 | 2.880076 | 86.674221 | 8.036827 | 500.758263 | 0.171860 | 9.412071 |
| 2 | 1968.142602 | 75975.609626 | 0.037433 | 0.204991 | 50.436720 | 633.704100 | 69.916221 | 462.395722 | 102.483066 | 71.395722 | 78.427807 | 1.294118 | 5.219251 | 6.024955 | 8.331551 | 2.891266 | 0.083779 | 0.135472 | 0.263815 | 0.215686 | 0.030303 | 0.007130 | 0.290553 | 47.857398 | 0.242424 | 1.618538 | 1.860963 | 1339.894831 | 20.869875 | 550.069519 | 1.019608 | 69.554890 |
Observations and Insights:
Our clusters are divided by Income with High, Medium, and Low. We also have distinct divisions for Expenses and AmountPerPurchase with High, Medium, and Low.
Let us create a boxplot for each of the variables
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: Middle Income
Summary for cluster 0:___ This cluster has 607 individuals represents the middle group of individuals. They have the second highest income, averge of 1 kid in the home with an average family size of 3, and are the middle group for all purchases. They do however use more deals than any other group, shop on the web more, represent the oldest group, make the highest number of purchases and have been customers for the longest amount of time. The are the middle group for web visits, store purchases and cataglog purchases, but they have the highest number of web purchases. They are second highest for accepting campaigns and total amount per purchase. They are the lowest for recency which is the only category in which they are the lowest. The average age for this cluster is 50 years old.
This cluster does the most shopping online and takes advantages of more deals than any other group. They also shop the most frequently and have been customers for the longest amount of time. Their income and family size suggest that they are the most closely related to the middle class.
Cluster 1: Low Income
Summary for cluster 1:___ This cluster has 1059 individuals and is the group that has the lowest income, are the youngest and have the most kids in their home. They have spent the least amount of money on all categories of food by a considerable amount and make the least number of purchases. They have the lowest expenses, have not accepted many campaigns but they rank second in those who have made purchases using deals. They also visit the website more than the other two categories. They are also the youngest.
This group does not make many purchases or shop a lot with the company. They do a fair amount of window shopping so to speak by looking at the website, but they make few actual purchases. Despite spending the most time looking at the website, they are still lowest in terms of website purchases. This group is really watching how they spend their money and are buying their groceries elsewhere and have been customers for the shortest period of time.
Cluster 2: High Income
Summary for cluster 2:___ This cluster has 561 individuals and represents the individuals with the highest income. The make the most purchases in all sub-categories and use the least number of deals and visit the website the least. They are highest for catalog and in store purchases as well as the highest for accepting every campaign. They have the second highest number of purchases but their purchase amount is vastly greater than any other group as are their expenses.
The high income group is the most engages cluster. They have responded to the highest number of campaigns and spend the most money with the company. This group is also the oldest group and uses the catalog more than any other group prefering catalog and in store purchases over going on the website at all
Think About It:
We can see from the above profiles that K=3 segments the customers into High, Medium and Low-income customers, and we are not getting deep insights into different types of customers. So, let's try to build K=5 (which has another elbow in the Elbow curve) and see if we can get better cluster profiles.
# Dropping labels we got from K=3 since we will be using PCA data for prediction
# Drop K_means_segments_3. Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_3", axis=1, inplace=True)
data.drop("K_means_segments_3", axis=1, inplace=True)
Let's build K-Means using K=5
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
kmeans = KMeans(n_clusters=5, random_state = 1)
kmeans.fit(data_pca)
KMeans(n_clusters=5, random_state=1)
# Add K-Means cluster labels to data_pca
data_pca["K_means_segments_5"] = kmeans.labels_
data_model["K_means_segments_5"] = kmeans.labels_
# Add K-Means cluster labels to whole data
data["K_means_segments_5"] = kmeans.labels_
# Add K-Means cluster labels to data_model
data_model["K_means_segments_5"] = kmeans.labels_
# Let's check the distribution
data_model["K_means_segments_5"].value_counts()
2 994 0 403 1 357 4 308 3 165 Name: K_means_segments_5, dtype: int64
Let's visualize the clusters using PCA
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_5 = data.groupby("K_means_segments_5").mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_5 | ||||||||||||||||||||||||||||||||
| 0 | 1965.585608 | 63397.915633 | 0.106700 | 0.751861 | 48.406948 | 509.126551 | 33.553350 | 182.409429 | 40.461538 | 32.322581 | 62.414392 | 2.404467 | 6.347395 | 3.811414 | 8.970223 | 4.565757 | 0.044665 | 0.099256 | 0.024814 | 0.029777 | 0.007444 | 0.014888 | 0.054591 | 50.414392 | 0.858561 | 1.645161 | 2.503722 | 797.873449 | 21.533499 | 544.935484 | 0.260546 | 37.399934 |
| 1 | 1967.568627 | 74565.285714 | 0.025210 | 0.204482 | 51.050420 | 508.112045 | 80.560224 | 467.282913 | 120.535014 | 80.686275 | 86.733894 | 1.366947 | 4.893557 | 6.014006 | 8.378151 | 2.579832 | 0.030812 | 0.028011 | 0.103641 | 0.109244 | 0.000000 | 0.000000 | 0.140056 | 48.431373 | 0.229692 | 1.627451 | 1.857143 | 1257.176471 | 20.652661 | 531.176471 | 0.411765 | 63.251772 |
| 2 | 1971.243461 | 34929.668008 | 0.754527 | 0.461771 | 49.263581 | 39.363179 | 4.852113 | 20.918511 | 6.972837 | 5.027163 | 14.565392 | 1.850101 | 2.010060 | 0.523139 | 3.230382 | 6.305835 | 0.063380 | 0.014085 | 0.000000 | 0.001006 | 0.002012 | 0.011066 | 0.078471 | 44.756539 | 1.216298 | 1.651911 | 2.868209 | 77.133803 | 7.613682 | 491.091549 | 0.158954 | 8.966401 |
| 3 | 1969.896970 | 79500.800000 | 0.048485 | 0.187879 | 48.557576 | 945.696970 | 49.745455 | 476.921212 | 72.006061 | 56.648485 | 66.472727 | 1.206061 | 5.921212 | 6.200000 | 8.030303 | 3.690909 | 0.236364 | 0.412121 | 0.690909 | 0.496970 | 0.133333 | 0.006061 | 0.715152 | 46.103030 | 0.236364 | 1.600000 | 1.836364 | 1601.018182 | 21.357576 | 596.436364 | 2.684848 | 87.930039 |
| 4 | 1966.600649 | 48840.996753 | 0.584416 | 0.860390 | 47.626623 | 318.444805 | 11.090909 | 92.857143 | 18.685065 | 14.181818 | 54.425325 | 5.441558 | 6.035714 | 2.113636 | 5.889610 | 7.220779 | 0.103896 | 0.113636 | 0.003247 | 0.032468 | 0.006494 | 0.006494 | 0.214286 | 49.399351 | 1.444805 | 1.665584 | 3.110390 | 455.259740 | 19.480519 | 658.461039 | 0.474026 | 22.531721 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: Middle Income with Kids and Oldest
Summary for cluster 0:___
This group is made up of 403 of the oldest customers with the middle most average income. The have more teens in their home and less small children. They are the middle group in terms of all food purchases and have the second lowest recency, shopping more frequently than 3 other groups. They have the highest number for in store purchases, web purchases, and total purchases. They also have the highest number of complaints. Despite making the most purchases the are the middle group for expenses and total amount per purchase. They responded worst to campaign 2 and best to campaign 4.
Cluster 1: High Income with no Kids Older
Summary for cluster 1:___ This group is made up of 357 of the the second highest income customers with the lowest number of kids in their home. They have the highest recency and spend the most on Fruit, Fish, Sweets, and Gold products. They are second highest for catalog and instore purchases. They have the second highest expenses and purchases. The have the lowest number of website visits and they responded best to campaigns 1 and 5 and worst to campaign 2.
Cluster 2: Low Income with Kids and Youngest
Summary for cluster 2:___
This group is made up of 994 customer with the lowest income, the highest number of kids in the house and is the youngest group. They have the second highest recency and have the lowest numbers for purchases of all the different food categories. They are the middle group for deals purchases and the lowest for store, web and catalog purchases. They are second highest for the number of website visits per month and this coupled with the low income category and higher deals purchases indicates that they are mostly shopping when they have a deal and are regularly checking for deals. The are the newest customers with the lowest number of expenses, and purchases. They were most receptive to campaign 3 and least receptive to campaign 5. This group would be a good group for email marketing. They are checking the website regualarly looking for deals and personalized adds for them would get them in the door more often.
Cluster 3: High Income with no Kids and Young
Summary for cluster 3:___
This group is made up of 165 customers with the highest income, no children and has spend the most on wine and meat of all the groups. They have the highest number of catalog purchases and are the middle group for web purchases and store purchases. This group has accepted more of the campaigns than any other group. They have the second highest expenses and the second highest total number of purchases. They spend the most per purchase and are ranked second for how long they have been customers.
Cluster 4: Low Income with Kids and Older
Summary for cluster 4: __ This group is made up of 308 customers with low income, kids, and they are older. They are second lowest in all purchases but are ranked highest for making purchases with deals. They are second highest for website purchases and second lowest for catalog purchases and in store purchases. They have the highest number of website visits per month and the largest family size. The responded best to ad campaign 4 and have been customers the longest of all groups.
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_5", axis=1, inplace=True)
data.drop("K_means_segments_5", axis=1, inplace=True)
From the above profiles, K=5 provides more interesting insights about customer's purchasing behavior and preferred channels for purchasing products. We can also see that the High, Medium and Low income groups have different age groups and preferences, which was not evident in K=3. So, we can choose K=5.
Let's find the silhouette score for K=5 in K-Medoids
kmedo = KMedoids(n_clusters = 5, random_state = 1) # Initializing K-Medoids with number of clusters as 5 and random_state=1
preds = kmedo.fit_predict(data_pca) # Fit and predict K-Medoids using data_pca
score = silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.10644004638344105
Observations and Insights:
Our silhouette score for K-Medoids is 0.10644004638344105 which is lower than it was for K-means with k=5.
# Predicting on data_pca and ddding K-Medoids cluster labels to the whole data
data['K_medoids_segments_5']= kmedo.fit_predict(data_pca)
# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['K_medoids_segments_5']= kmedo.fit_predict(data_pca)
# Predicting on data_pca and ddding K-Medoids cluster labels to data_pca
data_pca['K_medoids_segments_5']= kmedo.fit_predict(data_pca)
# Let's check the distribution
data_model["K_medoids_segments_5"].value_counts()
1 630 2 580 0 477 3 278 4 262 Name: K_medoids_segments_5, dtype: int64
Let's visualize the clusters using PCA
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_medoids_segments_5")
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_KMedoids_5 = data.groupby("K_medoids_segments_5").mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMedoids_5.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_medoids_segments_5 | ||||||||||||||||||||||||||||||||
| 0 | 1968.664570 | 76368.226415 | 0.044025 | 0.220126 | 50.884696 | 658.668763 | 75.383648 | 481.109015 | 109.316562 | 79.784067 | 82.222222 | 1.385744 | 5.488470 | 5.920335 | 8.576520 | 3.075472 | 0.090147 | 0.132075 | 0.285115 | 0.232704 | 0.029350 | 0.004193 | 0.322851 | 47.335430 | 0.264151 | 1.610063 | 1.874214 | 1404.262055 | 21.371069 | 566.538784 | 1.092243 | 71.858839 |
| 1 | 1972.455556 | 32820.985714 | 0.800000 | 0.409524 | 39.488889 | 23.952381 | 3.442857 | 13.973016 | 5.104762 | 3.395238 | 10.428571 | 1.560317 | 1.644444 | 0.312698 | 2.885714 | 6.553968 | 0.085714 | 0.004762 | 0.000000 | 0.000000 | 0.003175 | 0.011111 | 0.084127 | 43.544444 | 1.209524 | 1.647619 | 2.857143 | 49.868254 | 6.403175 | 475.193651 | 0.177778 | 7.093609 |
| 2 | 1965.594828 | 61851.193103 | 0.172414 | 0.734483 | 46.206897 | 499.455172 | 27.700000 | 180.781034 | 37.748276 | 27.312069 | 63.991379 | 3.101724 | 6.156897 | 3.893103 | 8.237931 | 4.874138 | 0.068966 | 0.132759 | 0.043103 | 0.044828 | 0.020690 | 0.010345 | 0.112069 | 50.405172 | 0.906897 | 1.660345 | 2.567241 | 772.996552 | 21.389655 | 565.806897 | 0.422414 | 37.009533 |
| 3 | 1969.600719 | 38431.715827 | 0.784173 | 0.654676 | 45.960432 | 134.028777 | 8.032374 | 55.309353 | 13.003597 | 7.888489 | 37.133094 | 4.276978 | 4.262590 | 1.194245 | 4.197842 | 7.733813 | 0.068345 | 0.046763 | 0.000000 | 0.017986 | 0.003597 | 0.014388 | 0.223022 | 46.399281 | 1.438849 | 1.597122 | 3.035971 | 218.262590 | 13.931655 | 710.190647 | 0.359712 | 15.142116 |
| 4 | 1967.255725 | 43463.137405 | 0.561069 | 0.610687 | 78.835878 | 90.087786 | 8.790076 | 37.767176 | 11.583969 | 8.828244 | 19.458015 | 2.057252 | 2.782443 | 1.015267 | 4.206107 | 4.919847 | 0.026718 | 0.041985 | 0.003817 | 0.007634 | 0.000000 | 0.003817 | 0.000000 | 48.744275 | 1.171756 | 1.717557 | 2.889313 | 157.057252 | 10.061069 | 394.618321 | 0.080153 | 13.714535 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_medoids_segments_5'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: High Income no Kids/Teens
Summary for cluster 0:___ This group is made up of 477 customers who have high recency, the highest purchase amount for all food categories and the lowest use of deals. They have the lowest number of website visits and the highest store purchases and catalog purchases. They have the highest expenses, number of total purchases, amount per purchase and have accepted the most campaigns.
Cluster 1: Low Income with Kids and Youngest
Summary for cluster 1:___ This group is made up of 630 customers who have the lowest purchase amount for all categories but also the lowest recency. They are second highest for number of web visits in a month. They are second to lowest in accepting campaigns and in how long they have been customers. They have the lowest expenses and number of purchases.
Cluster 2: High/Middle Income with Kids/Teens and Oldest
Summary for cluster 2:___ This group is made up of 580 customers who have higher income and kids in the home. They are second highest for purchases on all food categories and deal usage. They are highest for web purchases and second highest for store purchases and catalog purchases. They are second lowest for web visits per month and were highest for accepting campaign 4. They make up the oldest group with the highest number of purchases and amount per purchase.
Cluster 3: Low/Middle Income with Kids/Teens Second Youngest
Summary for cluster 3:___ This group is made up of 278 customers who have low income and kids in the home. They are the middle group for food purchases and the highest users of the deals. Middle group for web and catalog purchases, second lowest for store purchases and highest for web visits per month. They are highest for complaints and second highest for response. They have the largest family size and are the middle group for expenses, number of total purchasea and amount per purchase. They have been customers the longest.
Cluster 4: Middle Income with Kids/Teens Second Oldest
Summary for cluster 4:___ This group is made up of 262 customers who have the highest recency. They are either second lowest or the very middle group for all food purchase categories. They are the middle group for number of deals purchases, catalog purchases, store purchases, and web visits. They are the second lowest group for web purchases. The campaign that they responded to the best was campaign 4 but the have the lowest response. They are the second oldest group with the highest status and second highest family size. They have the second lowest expenses, total purchases and amount per purchase. They are lowest for total accepted campaigns and have been customers for the least amount of time.
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("K_medoids_segments_5", axis=1, inplace=True)
data.drop("K_medoids_segments_5", axis=1, inplace=True)
Let's find the Cophenetic correlation for different distances with different linkage methods.
The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.
The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.
Cophenetic correlation is the way to compare two or more dendrograms.
Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average"]
high_cophenet_corr = 0 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0] # Creating a list by assigning 0's to it
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(data_pca, metric=dm, method=lm) # Applying different linkages with different distance on data_pca
c, coph_dists = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c: # Checking if cophenetic correlation is higher than previous score
high_cophenet_corr = c # Appending to high_cophenet_corr list if it is higher
high_dm_lm[0] = dm # Appending its corresponding distance
high_dm_lm[1] = lm # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8070498805503135. Cophenetic correlation for Euclidean distance and complete linkage is 0.5915211029724521. Cophenetic correlation for Euclidean distance and average linkage is 0.8563332929010883. Cophenetic correlation for Chebyshev distance and single linkage is 0.7923751605955872. Cophenetic correlation for Chebyshev distance and complete linkage is 0.68529576496396. Cophenetic correlation for Chebyshev distance and average linkage is 0.8041971696973682. Cophenetic correlation for Mahalanobis distance and single linkage is 0.6226039066330016. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.38496663858473046. Cophenetic correlation for Mahalanobis distance and average linkage is 0.6546345527169063. Cophenetic correlation for Cityblock distance and single linkage is 0.8051878506561675. Cophenetic correlation for Cityblock distance and complete linkage is 0.816242660463673. Cophenetic correlation for Cityblock distance and average linkage is 0.8566433845473891.
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8566433845473891, which is obtained with Cityblock distance and average linkage.
HCmodel = AgglomerativeClustering(n_clusters= 3, affinity= "cityblock", linkage= "average",)
# Fit on data_pca
HCmodel.fit(data_pca)
labels = HCmodel.labels_
score = silhouette_score(data_pca, labels) # Calculate the silhouette score
print(score)
0.6409888705869505
# Add Agglomerative Clustering cluster labels to data_pca
data_pca["HClabels"]= HCmodel.labels_
# Add Agglomerative Clustering cluster labels to the whole data
data["HClabels"]= HCmodel.labels_
# Add Agglomerative Clustering cluster labels to data_model
data_model["HClabels"]= HCmodel.labels_
data_model["HClabels"].value_counts()
0 2225 2 1 1 1 Name: HClabels, dtype: int64
Let's have a look at the dendrograms for different linkages with Cityblock distance
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Cityblock", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights:
We can make out 2 good links in single linkage. Complete linkage looks promising in that you can see links but because we are more concerned with the links at the top having the greatest height we should look to average linkage. Average linkage shows the distance between the two subclusters that are merged by that link as looking the best.
single is .81 complete is .82 average is .86
Think about it:
Let's have a look at the dendrograms for different linkages with Chebyshev distance
# Plot the dendrogram for Chebyshev distance with linkages single, complete and average.
# Hint: Use Chebyshev distance as the metric in the linkage() function
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Chebyshev", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights: Single linkage gives 2 links clearly before loosing height. Complete linkage shows the most clearly defined links bust again height is lost drastically after 2. Average looks good for the first two but quickly looses height after that.
Let's have a look at the dendrograms for different linkages with Mahalanobis distance
# Plot the dendrogram for Mahalanobis distance with linkages single, complete and average.
# Hint: Use Mahalanobis distance as the metric in the linkage() function
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Mahalanobis", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights:
Mahalanobis has really good height for their links in average.
single linkage has a score of.62
complete linkage has a score of .38
average linkage has a score of .65
Let's have a look at the dendrograms for different linkages with Euclidean distance
# Plot the dendrogram for Euclidean distance with linkages single, complete, average and ward.
# Hint: Use Euclidean distance as the metric in the linkage() function
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Euclidean", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Think about it:
Observations and Insights:
Single linkage has cophenetic correlation of 0.81
Complete linkage has cophenetic correlation of 0.59
Euclidean average has good height for the first 3 clusters with cophenetic correlation of 0.86
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters= 3, affinity= "euclidean", linkage= "ward",)
# Fit on data_pca
HCmodel.fit(data_pca)
labels = HCmodel.labels_
score = silhouette_score(data_pca, labels) # Calculate the silhouette score
print(score)
0.2543856181377133
# Add Agglomerative Clustering cluster labels to data_pca
data_pca["HClabels"]= HCmodel.labels_
# Add Agglomerative Clustering cluster labels to the whole data
data["HClabels"]= HCmodel.labels_
# Add Agglomerative Clustering cluster labels to data_model
data_model["HClabels"]= HCmodel.labels_
# Let's check the distribution
data_model["HClabels"].value_counts()
1 1010 2 675 0 542 Name: HClabels, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "HClabels")
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_HClabels = data.groupby("HClabels").mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_HClabels.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HClabels | ||||||||||||||||||||||||||||||||
| 0 | 1968.732472 | 75132.143911 | 0.055351 | 0.241697 | 48.254613 | 607.841328 | 72.605166 | 449.559041 | 104.774908 | 73.365314 | 73.188192 | 1.418819 | 5.370849 | 5.618081 | 8.612546 | 2.998155 | 0.073801 | 0.125461 | 0.249077 | 0.195572 | 0.033210 | 0.005535 | 0.278598 | 47.267528 | 0.297048 | 1.608856 | 1.905904 | 1308.145756 | 21.020295 | 543.605166 | 0.955720 | 67.627434 |
| 1 | 1971.227723 | 34931.661386 | 0.768317 | 0.470297 | 49.204950 | 41.816832 | 4.583168 | 21.501980 | 6.456436 | 4.725743 | 15.413861 | 1.923762 | 2.087129 | 0.536634 | 3.202970 | 6.377228 | 0.072277 | 0.012871 | 0.000000 | 0.000000 | 0.001980 | 0.010891 | 0.087129 | 44.772277 | 1.238614 | 1.646535 | 2.885149 | 79.084158 | 7.750495 | 502.357426 | 0.174257 | 8.990624 |
| 2 | 1965.514074 | 57847.875556 | 0.272593 | 0.777778 | 49.675556 | 456.499259 | 21.848889 | 152.648889 | 30.511111 | 23.712593 | 63.837037 | 3.642963 | 6.103704 | 3.392593 | 7.485926 | 5.623704 | 0.074074 | 0.127407 | 0.040000 | 0.056296 | 0.013333 | 0.008889 | 0.140741 | 50.485926 | 1.050370 | 1.671111 | 2.721481 | 685.220741 | 20.625185 | 587.536296 | 0.451852 | 33.006349 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['HClabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: High Income no Kids
Summary for cluster 0: High Income no Kids
This cluster is made up of 542 customers with the highest income and no kids or teens in the homes. They have the lowest recency and the highest purchases for all the food categories, use the least number of deals, are the middle group for we purchases and the highest group for catalog purchase and store purchases. They have the least number of web visits per month and have the best acceptance for campaigns over all, coming in the lead for campaigns 1, 2, and 5, 5 being the one they responded to the best. They have the highest expenses, total number of purchases and amount per purchase and are the middle group for the age category.
Cluster 1: Low Income with Kids
Summary for cluster 1:___ This group is made up of 1010 of the youngest customers with low income and the most kids at home. They have the lowest purchases for all the food categories by a considerable amount and are the lowest for web, catalog and store purchases. They are the middle group for deals purchases and are the highest for number of web visits per month. Of all the campaigns they responded best to campaign 3 and they have the highest number of complaints. They have the largest family size, the lowest expenses, total purchases, total accepted campaigns, amount per purchase and have been customers for the the shortest amount of time.
Cluster 2: Middle Income with Kids
Summary for cluster 2:___ This group is made up of 675 customers with middle income and kids in the home. They have the highest recency and are the middle group for all food products and the highest for deals purchases and web purchases. Middle group for catalog and store purchases, web visits per month, and are the highest group for accepting campaign 3 and 4, 4 being the one that was responded to the best. They are the oldest group and the middle group for expenses, number of total purchases, total accepted campaigns, and amount per purchase. They have been customers the longest of all the groups.
Observations and Insights:
Ward is known for making clusters that are more equal in value but here we still have a cluster that have twice as many values as another.
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("HClabels", axis=1, inplace=True)
data.drop("HClabels", axis=1, inplace=True)
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Initializing lists
eps_value = [2,3] # Taking random eps value
min_sample_values = [6, 20] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.1998438349972484 For eps value =2 For min sample =20 The average silhoutte_score is : 0.33805442426361443 For eps value =3 For min sample =6 The average silhoutte_score is : 0.3330837442382384 For eps value =3 For min sample =20 The average silhoutte_score is : 0.3390186745863706
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
Now, let's apply DBSCAN using the hyperparameter values we have received above.
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 3, min_samples = 20)
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data["DBSCAN"]= dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model["DBSCAN"]= dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca["DBSCAN"]= dbs.fit_predict(data_pca)
# Let's check the distribution
data_model["DBSCAN"].value_counts()
0 1902 -1 325 Name: DBSCAN, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "DBSCAN")
Observations and Insights:
While our silhouette score is very good, our clusters are not what we are looking for. Two clusters will not give insight into our customers and these two clusters in particular are not evenly distributed.
DBSCAN was able to give only two clusters with eps = 3 and min_sample = 20 which is very skewed. It is not able perform well on this dataset. This is due to the overlapping nature of the data which causes multiple clusters to get grouped together into one large cluster.
Think about it:
Note: You can experiment with different eps and min_sample values to see if DBSCAN produces good distribution and cluster profiles.
# Dropping labels we got from DBSCAN since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("DBSCAN", axis=1, inplace=True)
data.drop("DBSCAN", axis=1, inplace=True)
Let's find the silhouette score for K=5 in Gaussian Mixture
gmm = GaussianMixture(n_components = 5, random_state = 1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1
preds = gmm.fit_predict(data_pca) # Fit and predict Gaussian Mixture Model using data_pca
score = silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.1423192906893203
Observations and Insights:
The silhouette score can be improved by lowering the number of components, but we will loose some of the more interesting discrepencies that we have found by having 5 clusters. For clusters equal to 5 we get a better silhouette score with k-means.
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data['Gaussian_Mixture']= gmm.fit_predict(data_pca)
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model['Gaussian_Mixture']= gmm.fit_predict(data_pca)
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca['Gaussian_Mixture']= gmm.fit_predict(data_pca)
# Let's check the distribution
data_model["Gaussian_Mixture"].value_counts()
1 737 4 630 0 431 3 369 2 60 Name: Gaussian_Mixture, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "Gaussian_Mixture")
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_Gaussian_Mixture = data.groupby("Gaussian_Mixture").mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_Gaussian_Mixture.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gaussian_Mixture | ||||||||||||||||||||||||||||||||
| 0 | 1968.116009 | 76767.313225 | 0.013921 | 0.132251 | 51.220418 | 620.044084 | 66.976798 | 487.909513 | 101.807425 | 68.612529 | 74.779582 | 1.048724 | 4.955916 | 5.844548 | 8.034803 | 2.301624 | 0.081206 | 0.106729 | 0.264501 | 0.197216 | 0.013921 | 0.004640 | 0.255220 | 47.883991 | 0.146172 | 1.624130 | 1.770302 | 1345.350348 | 19.883991 | 515.719258 | 0.918794 | 68.336971 |
| 1 | 1972.195387 | 31711.922659 | 0.854817 | 0.435550 | 48.537313 | 18.434193 | 3.134328 | 12.312076 | 4.666214 | 3.297151 | 10.364993 | 1.758480 | 1.610583 | 0.303935 | 2.857531 | 6.428765 | 0.075984 | 0.004071 | 0.000000 | 0.000000 | 0.002714 | 0.013569 | 0.080054 | 43.804613 | 1.290366 | 1.644505 | 2.934871 | 41.843962 | 6.530529 | 485.981004 | 0.162822 | 6.243942 |
| 2 | 1968.600000 | 67479.133333 | 0.233333 | 0.366667 | 51.316667 | 769.400000 | 50.250000 | 397.300000 | 57.983333 | 45.466667 | 68.900000 | 1.783333 | 3.716667 | 4.466667 | 5.133333 | 5.750000 | 0.100000 | 0.400000 | 0.483333 | 0.333333 | 0.133333 | 0.000000 | 0.516667 | 47.400000 | 0.600000 | 1.616667 | 2.216667 | 1320.400000 | 15.100000 | 640.566667 | 1.966667 | 112.367329 |
| 3 | 1966.287263 | 63578.964770 | 0.157182 | 0.720867 | 45.607046 | 568.227642 | 47.121951 | 216.214092 | 62.536585 | 50.821138 | 79.525745 | 3.268293 | 6.769648 | 4.783198 | 9.411924 | 5.403794 | 0.089431 | 0.121951 | 0.048780 | 0.081301 | 0.029810 | 0.013550 | 0.176152 | 49.712737 | 0.878049 | 1.628726 | 2.506775 | 944.921409 | 24.233062 | 629.457995 | 0.547425 | 39.293014 |
| 4 | 1967.100000 | 49336.715873 | 0.447619 | 0.738095 | 50.200000 | 227.233333 | 11.347619 | 72.114286 | 15.922222 | 11.206349 | 39.647619 | 3.349206 | 4.912698 | 1.747619 | 5.723810 | 6.020635 | 0.052381 | 0.077778 | 0.001587 | 0.014286 | 0.003175 | 0.004762 | 0.109524 | 48.900000 | 1.185714 | 1.671429 | 2.857143 | 337.823810 | 15.733333 | 551.515873 | 0.258730 | 20.186831 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['Gaussian_Mixture'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: High Income no Kids, Middle Age
Summary for cluster 0:___ This group is made up of 431 customers with the highest income and no kids or teens. The have the highest purchases for all food catagories other than Gold products. They have the lowest number of deals purchases, the highest catalog purchases and second highest for web and store purchases. They have the lowest number of web visits per month and are second highest for accepting total number of campaigns, 5 being the one they accepted the most. They have the highest expenses, and are second highest for number of total purchases, and amount per purchase. They are second lowest for engaged in days.
Cluster 1: Low Income with Kids, Youngest
Summary for cluster 1:___ This group is made up of 737 of the youngest customers with low income and kids and teens in the house. They have the lowest purchases for all the food categories and lowest for web, store, and catalog purchases. They are second lowest for deals purchases and highest for web visits per month. They are lowest for accepting campaigns with campaign 3 being the one they accepted best. They have the highest complaints, highest number of kids, largest family size, and are lowest for expenses, total purchases, amount per purchase and engaged in days.
Cluster 2: High/Middle Income with Teens
Summary for cluster 2:___ This group is made of 60 customers with high income and teens in the home. They have the highest recency and highest wine purchases, coming in second for fruit and meat products. They are second lowest for web and store purchases. This group falls in the middle for a lot of other categories but are highest for accepting all the campaigns and response. The have the second highest expenses, but are second lowest for total number of purchases. The are highest for amount per purchase and engaged in days.
Cluster 3: Middle Income with Teens, Oldest
Summary for cluster 3:___ This group is made of 369 of the oldest customers with middle income and teens in the home. They are the middle group for wine, fruit, and meat products and the second highest for fish and sweet products. They are the highest for gold products and second highest for number of deals purchases and catalog purchases. They are highest for number of web purchases and store purchases, second lowest for number of web visits per month. They are the middle group for total accepted campaigns, 4 being the one the accepted most. They are the middle group for expenses and amount per purchase but the top group for number of total purchases. They are second highest for engaged in days.
Cluster 4: Lower/Middle Income with Kids/Teens, Second Oldest
Summary for cluster 4:___ This group is made of 630 of the lower middle income class and the highest number of teens in the home. They are second lowest for all the food categories and the highest users of the deals purchases. They are the middle group for web purchases and store purchases. They are second lowest for catalog purchases and second highest for web visits per month. The are the second lowest group for total accepted campaigns, 4 being the one they accepted most. They are the second oldest group with the highest status and second highest family size. They are second lowest for expenses and amount per purchase. They are the middle group for number of total purchases and are second lowest for engaged in days.
data_pca.drop("Gaussian_Mixture", axis=1, inplace=True)
data.drop("Gaussian_Mixture", axis=1, inplace=True)
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
In building our models we first started with the most simple model, K-means. Figure 3 shows the elbow plot we made to see what the ideal number of clusters would be and saw that 3 or 5 would make a good value for k.
We executed different clustering algorithms with k equal to both 3 and 5 and found that k=5 gave us more insight into our customers purchasing behavior and their marketing preferences. We then went on to calculate silhouette score for all of our clustering methods beginning with K-means where our silhouette score for 3 clusters was 0.27 and our silhouette score for 5 clusters was 0.24. Figure 4 below shows the clusters for K-Means with k=3 that gave us our best silhouette score. While this method was a good option that was computationally inexpensive it only showed us basic clusters that lacked complexity.
Second we preformed K-medoids clusters to see if our data responded better. This method is good at reducing noise and outliers and it gave us a silhouette score of 0.11. The clusters for this method were similar to that of K-means but with a lower silhouette score.
Hierarchical clustering led us to find the cophenetic correlation for different distances using different linkage methods. Our highest cophenetic correlation is 0.86, which is obtained with Cityblock distance and average linkage.
We preformed our hierarchal clustering method using euclidean as our distance and ward as our linkage which gave us a silhouette score of 0.25. While this silhouette score was high it only gave us three clusters because our dendrograms did not indicate that 5 was advisable.
When we preformed DBSCAN we found our best silhouette score for eps value =3 and min sample =20. The silhoutte score for this method was 0.34, our best so far however it divided our data into just 2 clusters which would not give us enough information about our customers and this was also the reason why the silhouette score was so high. The issue with using DBSCAN is that there is overlap in our clusters and so multiple clusters got grouped together into one large cluster. We changed the hyperparameters to see if we could get better results without success.
Lastly we used the gaussian mixture model with k=5 and got back a silhouette score of 0.14. With k equal to 5 this gave us the best silhouette score after K-means, out performing K-medoids.
Of the clustering methods that divided our data into 5 clusters K-means preformed the best. We discovered early on that 5 clusters would give us the most insight into our customers buying habits and this remained true throughout the analysis.
By implementing unsupervised learning methods such as dimensionality reduction and clustering we were better able to identify patterns in our clients dataset of customers. To better understand our customers and increase ROI we used customer segmentation that will group our clients into like clusters. This helped identify the customers preferred method of purchasing and maximized the clients marketing potential.
To achieve this goal we preformed data analysis and dimensionality reduction that gave us insight into the customers defining attributes. When we began exploring our clustering methods the algorithms success was judged based on the rank of their silhouette scores. We chose the method with the highest silhouette score to ensure that our clusters would be grouped based on their cohesion compared to other clusters. Our most successful clustering method that gave insight into purchasing patterns while also providing diverse customer profiles was K-Means with k equal to 5 which gave us a silhouette score of 0.24. This method gave us five distinct clusters based on income being either high, middle, or low and then separated those classes further according to age and family size. By implementing this clustering method we were able to identify patterns in the customer dataset that allow for more effective marketing campaigns; thus increasing productivity and sales.
The model that will produce the best results for the business and the marketing team specifically is K-Means with k=5. These clusters are clearly defined according to income, family size, age, purchasing habits, and campaign interactions. We can also make some valuable suggestions based on the trends in the clusters as to how to treat each group going forward.
While some of the other methods would be interesting to explore, cloud services can get very expensive with large datasets. While our dataset is not particularly large right now it will continue to grow as we add new data and begin to gain new customers.
We can confidently proceed with the implementation of our model, knowing that the clusters that were created are giving us the best possible outcome based on our carefully chosen segmentation attributes and the precise method we used to procure them. We are able to draw viable conclusions about our clusters in regards to buying habits and how receptive each group will be to various types of advertising.